On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

نویسندگان

  • Huizhen Yu
  • Dimitri P. Bertsekas
چکیده

We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path (SSP) problems, which are total cost Markov decision processes with an absorbing and cost-free state. For the most commonly used SSP models, existing convergence proofs assume that the sequence of Q-learning iterates is bounded with probability one, or some other condition that guarantees boundedness. We prove that the sequence of iterates is naturally bounded with probability one, thus furnishing the boundedness condition in the convergence proof by Tsitsiklis [Tsi94] and establishing completely the convergence of Q-learning for these SSP models. Mar 2011; revised Sep 2011, Apr 2012 ∗Laboratory for Information and Decision Systems (LIDS), M.I.T. (janey [email protected]). †Laboratory for Information and Decision Systems (LIDS) and Dept. EECS, M.I.T. ([email protected]).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic Shortest Path Games and Q-Learning

We consider a class of two-player zero-sum stochastic games with finite state and compact control spaces, which we call stochastic shortest path (SSP) games. They are total cost stochastic dynamic games that have a cost-free termination state. Based on their close connection to singleplayer SSP problems, we introduce model conditions that characterize a general subclass of these games that have...

متن کامل

Stochastic approximation for non-expansive maps : application to Q-learning algorithms

We discuss synchronous and asynchronous iterations of the form x = x + γ(k)(h(x) + w), where h is a suitable map and {wk} is a deterministic or stochastic sequence satisfying suitable conditions. In particular, in the stochastic case, these are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark’s lemma for the synchronous case or on...

متن کامل

Stochastic Approximation for Non-Expansive Maps:1 Application to Q-Learning Algorithms

We discuss synchronous and asynchronous variants of fixed point iterations of the form xk+1 = xk + γ(k) ( F (xk, ξk)− xk ) , where F is a non-expansive mapping under a suitable norm, and {ξk} is a stochastic sequence. These are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark’s Lemma for the synchronous case or Borkar’s Theorem fo...

متن کامل

Boundedness of iterates in Q-Learning

Reinforcement Learning (RL) is a simulation-based counterpart of stochastic dynamic programming. In recent years, it has been used in solving complex Markov decision problems (MDPs). Watkins’ Q-Learning is by far the most popular RL algorithm used for solving discounted-reward MDPs. The boundedness of the iterates in Q-Learning plays a critical role in its convergence analysis and in making the...

متن کامل

Q-learning and policy iteration algorithms for stochastic shortest path problems

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Math. Oper. Res.

دوره 38  شماره 

صفحات  -

تاریخ انتشار 2013